## [1] 4898 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
50% of wines have a fixed acidity between 6.3 and 7.3 g/L. 50% of wines have a volatile acidity between 0.21 and 0.32 g/L. 50% of wines have a citric acid content between 0.27 and 0.39 g/L. Mean residual sugar is 6.391 g/L. Density falls within a relatively tight range of 0.9871 and 1.039 g/cm^3. 50% of wines have a pH between 3.090 and 3.28. Alcohol percentage varies from 8% to 14.2% . Wine quality varies between 3 and 9, with a median of 6 and mean of 5.878.
## [1] 10.3 10.3 10.7 10.7 11.8 14.2
Fixed acidity appears to have a relatively normal distribution around the mean of 6.855 g/L. There appear to be some outliers at a very high fixed acidity, around 11.8 and 14.2 g/L. I wonder if the high level of fixed acidity effects the wine quality.
## [1] 0.780 0.785 0.815 0.850 0.905 0.910 0.930 0.965 1.005 1.100
Volatile acidity appears to have a relatively right skewed distribution around the median of 0.26 g/L. There appear to be a number of higher values above 0.70 g/L. Plotting the volatile acidity with a log10 scale there appears to be a near bimodal distribution with peaks at 0.18 and 0.3. I wonder if the high level of fixed acidity affects the wine quality, or if the ratio of acidities affects the wine quality. Below I have plotted the ratios of acidities in histograms.
In the bivariate plots section, I will compare the above plots. It appears the citric acid and volatile acidity have common ratios occuring at 1 and 2.
## [1] 1.00 1.00 1.00 1.00 1.23 1.66
Citric acid levels appears to have a relatively normal distribution around the median of 0.32 g/L. There appear to be some outliers with high citric acid levels, around 1.23 and 1.66 g/L. There is also a local mode at 0.49 g/L. This could be a regulated addition for certain types of wines or wineries. It would be interesting to see the quality of the wines at the local mode 0.49 g/L.
## [1] 23.50 26.05 26.05 31.60 31.60 65.80
Residual sugar levels appears to have a very right skewed distribution with a long tail. The median occurs at 5.2 g/L, the mean occurs at 6.391 g/L. Plotting residual sugar on a log10 scale shows a bimodal distribution with peaks around 2 and 10 g/L. There appear to be some outliers with residual sugar levels, at 31.60, and 65.80 g/L.
## [1] 0.244 0.255 0.271 0.290 0.301 0.346
The chloride levels show a slight bimodal distribution around 0.36 and 0.46 g/L, with a very large tail. Outliers run up to 0.346 g/L. Can the bimodal distribution be attributed to a correlation with another variable?
Free sulfur dioxided has a near normal distributio centered around 30 mg/L. The histogram for free sulfur dioxide has a proportionally large tail, running to 289 mg/L. My research shows that free SO2 is detectable by sensitive tasters, so it would be interesting to see if there is a threshold range at which the quality suffers.
Total sulfur dioxide has a near normal distribution near the median of 134 mg/L. Below the ratio of free sulfur dioxide to total sulfur dioxide is plotted on a histogram:
Below I have plotted density histograms:
Density appears to have some consistent - stepped density ranges. high frequencies between 0.991 to 0.994, mid frequencies between 0.994 and 0.996, and lower frequencies between 0.996 and 0.999. Maybe these densities correspond to alcohol percentages.
pH has a relatively normal distribution, centered around the median of 3.18.
Sulphates have a right skewed distribution, with a mean and median of 0.49 and 0.47 g/L, and should be correlated to the sulphur dioxide levels.
Alcohol percentage varies between 8% and 14.2%.
Quality varies between 3 and 9, with a median of 6 and a mean of 5.878.
There are 4898 sampled wines in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur fioxide, density, pH, sulphates, alcohol, quality) All features are number types except quality - which is an integer. Quality has a range of 3 to 9, with a median of 6 and mean of 5.878.
The main feature is quality. I am looking to find how the 11 input variables influence the quality of a wine, so I can predict the quality of a wine based on its chemical features.
From the dataset summary, I can start with the hint that high amounts of volatile acidity lead to an unpleasant taste - therefore lowering the quality score. Citric acid is said to add ‘freshness’ and flavor to wines - so may be a good indicator of quality as well. Sulfur Dioxide content may also be a good indicator of quality, as it is detectable in high concentrations and may be unpleasant.
I created total acidity which is a sum of fixed acidity, volitile acidity and citric acid, and is measured in g/L.This will be useful in the bivariate and multivariate analysis where I investigate whether ratios of acids and other features effect quality. I also created a factor data type from the quality variable.
I also created a ratio of free sulfur dioxide to total sulfur dioxide. This will also be investigated with relation to quality and sulphate levels.
It appears the ratio of citric acid to volatile acidity have local modes at 1 and 2 on the histograms, showing that wine makers may be adding these ingredients to make these ratios. Perhaps this ratio is used to influence the quality of the wine. There also appears to be a spike of results for citric acid at 0.49 g/L. Density appears to have some consistent - stepped density ranges. high frequencies between 0.991 to 0.994, mid frequencies between 0.994 and 0.996, and lower frequencies between 0.996 and 0.999. I will investigate if these ranges can be attributed to ranges in another feature.
I plotted all features with non normal distribuitions, or with long tailed distributions on a log10 scale. Volatile acidity and residual sugar were found to have bimodal distributions on this scale - I will investigate this in the following analysis.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.02269729 0.289180698
## volatile.acidity -0.02269729 1.00000000 -0.149471811
## citric.acid 0.28918070 -0.14947181 1.000000000
## residual.sugar 0.08902070 0.06428606 0.094211624
## chlorides 0.02308564 0.07051157 0.114364448
## free.sulfur.dioxide -0.04939586 -0.09701194 0.094077221
## total.sulfur.dioxide 0.09106976 0.08926050 0.121130798
## density 0.26533101 0.02711385 0.149502571
## pH -0.42585829 -0.03191537 -0.163748211
## sulphates -0.01714299 -0.03572815 0.062330940
## alcohol -0.12088112 0.06771794 -0.075728730
## quality -0.11366283 -0.19472297 -0.009209091
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.08902070 0.02308564 -0.0493958591
## volatile.acidity 0.06428606 0.07051157 -0.0970119393
## citric.acid 0.09421162 0.11436445 0.0940772210
## residual.sugar 1.00000000 0.08868454 0.2990983537
## chlorides 0.08868454 1.00000000 0.1013923521
## free.sulfur.dioxide 0.29909835 0.10139235 1.0000000000
## total.sulfur.dioxide 0.40143931 0.19891030 0.6155009650
## density 0.83896645 0.25721132 0.2942104109
## pH -0.19413345 -0.09043946 -0.0006177961
## sulphates -0.02666437 0.01676288 0.0592172458
## alcohol -0.45063122 -0.36018871 -0.2501039415
## quality -0.09757683 -0.20993441 0.0081580671
## total.sulfur.dioxide density pH
## fixed.acidity 0.091069756 0.26533101 -0.4258582910
## volatile.acidity 0.089260504 0.02711385 -0.0319153683
## citric.acid 0.121130798 0.14950257 -0.1637482114
## residual.sugar 0.401439311 0.83896645 -0.1941334540
## chlorides 0.198910300 0.25721132 -0.0904394560
## free.sulfur.dioxide 0.615500965 0.29421041 -0.0006177961
## total.sulfur.dioxide 1.000000000 0.52988132 0.0023209718
## density 0.529881324 1.00000000 -0.0935914935
## pH 0.002320972 -0.09359149 1.0000000000
## sulphates 0.134562367 0.07449315 0.1559514973
## alcohol -0.448892102 -0.78013762 0.1214320987
## quality -0.174737218 -0.30712331 0.0994272457
## sulphates alcohol quality
## fixed.acidity -0.01714299 -0.12088112 -0.113662831
## volatile.acidity -0.03572815 0.06771794 -0.194722969
## citric.acid 0.06233094 -0.07572873 -0.009209091
## residual.sugar -0.02666437 -0.45063122 -0.097576829
## chlorides 0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide 0.05921725 -0.25010394 0.008158067
## total.sulfur.dioxide 0.13456237 -0.44889210 -0.174737218
## density 0.07449315 -0.78013762 -0.307123313
## pH 0.15595150 0.12143210 0.099427246
## sulphates 1.00000000 -0.01743277 0.053677877
## alcohol -0.01743277 1.00000000 0.435574715
## quality 0.05367788 0.43557472 1.000000000
From our correlation coefficents above: - Fixed acidity is loosely correlated with pH. - Residual sugar is correlated with density, and loosely correlated with total sulfur dioxide and alcohol. - Chlorides are somewhat correlated to quality and alcohol - Free sulfur dioxide is closely correlated to total sulfur dioxide. - Total sulfur dioxide is correlated to density and negatively correlated to alcohol. - Density is correlated total alcohol and loosely correlated to quality. - Alcohol and quality are also loosely correlated.
Below I look at how the different features plot against quality
From the density functions on the ggpairs plot I see that linear correlation may not be effective in discovering trends in the data. The density functions show multiple peaks and troughs along multiple axis. This data will reveal most of its important information in the multivariate analysis.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.60 6.90 7.10 7.42 7.40 9.10
## [1] 9.1 6.6 7.4 6.9 7.1
Wines with a quality of 9 tend to focus towards a fixed acidity of 7.42 g/L, however with only 5 wines achieving a quality of 9 this may not be the most reliable of statistics. Looking at the trend of fixed.acidity.mean, it appears the wine with a quality of 9 and a fixed.acidity of 9.1 may be the outlier skewing our trend, or perhaps it is a wine style with a different flavour profile. In general - lower quality wines have a higher fixed acidity.
Higher quality wines ( quality > 6) appear to have a lower volatile acidity. It also appears from the box plot and from the scatter plot that there may be some grouping in this relationship that could be explored with the multivariate plots.
From the above plots we have an anomaly where there is a high proportion of wines with a citric acid value between 0.49 and 0.5. I will isolate that value by color in a multivariate plot to investigate correlations with other properties as well. The mean by quality plot shows that quality tends to increase with increasing citric acid, However the citric acid mean for wines with quality of 3 is mid range.
In this plot and the bove plot it appears the higher quality wines are within a smaller range than the lower quality wines.
## Warning: Removed 20 rows containing missing values (geom_point).
Residual sugar shows a definite split along quality with groups focused at 1.5g/L and 11g/L.
Chlorides appear to drop off as quality increases - maybe a parabolic relationship, as there appears to be a peak around qualities of 5.
The plot for quality against the free sulfur dioxide/total sulfur dioxide ratio shows a potential for a linear relationship.
At higher qualities, densities appear to split between a group at 0.992g/mL and 0.997g/mL. This will be further investigated in the multivariate plots section.
Quality appears to increase with increasing pH.
Sulphates don’t appear to have a very high correlation to quality
It looks like our judges like higher alcohol wines. Whether this is correlated to another factor will soon be investigated in the multivariate plots section.
It appears that as quality increases - the wine focuses on specific values for total sulphur dioxide and acidity. There are also some features that become bimodal with higher quality wines - such as residual sugar. This may be because of differing preferences, or because there are different styles within this type of wine, which may taste different, but all may be perceived as high quality. Fixed acidity focuses to two different values, volatile acidity focuses at high quality, citric acid sits within a certain range at high ratings. Residual sugar for wines with high quality sits at either near zero sugar or 10 g/dm^3. Chloride levels are low for high quality wines. Free sulfur dioxide focuses to two different values, as well as total sulfur dioxide. Density focuses to two different values. pH focuses to around 3.3. sulphates sit at 2 specific ranges. Alcohol at higher ratings sit around 10.3 and 12.5 percent. From our correlation matrix, it appears quality is most closely correlated to density alchohol and chlorides.
Alcohol and density are highly correlated. This relationship is expected, as density is used to measure final alcohol content in wines. Residual sugar, alchohol and acetic acid (volatile acidity) are also an interesting relationship - since alcohol and acetic acid are the products of metabolization of sugar by the yeast.
Density and residual sugar had the highest correlation of 0.79, The relationship is clear in the ggpairs plot.
I’ve plotted a scatter plot of residual sugar vs. density, colorizing the points by quality using a diverging color scheme. I chose these variables first due to their high correlation. There is a clear divergence between higher and lower quality wines.
Chlorides and alcohol vs. density also show divergent clusters in high vs. low quality wines.
Not seeing a very clear divergence or pattern for chlorides and total.sulfur.dioxide, or for density, chlorides and quality.
Not seeing a clear pattern or correlation with these variables either.
The bar chart is really helpful here in showing quality thresholds for each variable. for example - wines with free sulfur dioxide levels above 110 mg / L are almost exclusively low quality. With the histograms we see the frequency of occurrences within these ranges.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.974 6.000 9.000
I created a new dataframe of white wines, excluding wines which sit outside the quality thresholds for specific variables. I limited variables to the following ranges: free sulfur dioxide between 15 and 110 fixed acidity below 9 volatile acidity below 0.6 total sulfur dioxide between 60 and 210
Now in viewing the bar chart and histograms, we can see much clearer patterns. The wines that had chemical amounts exceeding threshold values are excluded.
Above I have plotted residual sugar against density and encoding quality in color, excluding the wines with out of threshold values. It appears this has removed some of the higher variance low quality points, giving us a clearer picture of the relationship.
The clearer picture can also be seen with chlorides vs. sulfur dioxide, chlorides vs. density, and density vs. alcohol.
When residual sugar is plotted vs alcohol the relationship is less obvious, however it can be seen that there is a much higher proportion of high sugar/low alcohol, low quality wines than high sugar / high alcohol wines.
As total sulfur dioxide increases, quality decreases. This was made easier to view by adding the dimension of chloride content.
The interaction between the chlorides and density with quality was made more apparent by plotting them against each other.
Plotting residual sugar vs. density, and colorizing by quality, a clear divergence became clear when holding either constant. This was seen with alcohol vs. density as well. This is interesting due to the critical interaction of these three variables in the wine making process. As wine ferments, alcohol replaces sugar. Additionally the density of wine is measured during the wine making process to determine alcohol content.
A really interesting find were the quality thresholds in free sulfur dioxide, fixed acidity, volatile acidity and total sulfur dioxide. Above and below certain thresholds for these chemicals, wine quality suffered immensely.
This distribution is bimodal, and shows a clear difference between a dry vs. a sweet wine.
This plot tells an interesting story. For a given amount of residual sugar, lower density wines have higher quality. It is very improbable to have a high quality wine with a density over 0.995. Also, from this plot we see that higher alcohol wines score higher. Proportionally there is a much higher probability of having a high sugar wine if there is a high alcohol content.
This plot shows the proportions of different wine qualities for all other variables. The most interesting parts of this plot are the clear quality thresholds in free sulfur dioxide, fixed acidity, volatile acidity and total sulfur dioxide. When the wine moves outside suitable ranges for these chemicals (very high proportions of 3 and 4 quality wines) it could be argued that the wine is ruined.
This dataset contains neary 5000 white wines of the Portuguese “Vinho Verde” variety. By observing the histograms of each of the 12 variables, and then plotting each of the eleven against the wine quality, I was able to isolate the variables which were most correlated to quality - alcohol, density and residual sugar. By plotting each variable in proportion bar charts along their respective ranges, I was able to find ranges of chemicals which have very high proportions of poor quality wines. For wine makers, this could be very useful information - if you see a batch of wine testing outside those threshold ranges, it is time to review your recipe or process, for that wine will most likely be of poor quality. As someone picking out a bottle of wine, the only information readily available is the alcohol content, which is printed on the bottle. Residual sugar or sweetness of the wine may be hinted at on the label, as well as whether sulphates were added. From my analysis, a non-expert in wine may find better luck with higher alcohol wines. If one were to require a sweeter wine, a higher proportion of high quality wines exist with higher alcohol contents.
I found difficulty in the bivariate plots section, since the level of correlation between most data was quite low. It was necessary to add other variables to see any type of pattern. I would have liked to observe the variables over limited ranges, and see if any patterns could be picked out using that technique. It was a good learning experience to plot all variables individually in the univariate and bivariate plot sections, however, not all information was useful. In the future I will rely more on statistical information to guide my investigations into correlation. I will say that simply finding correlations between variables is not enough as there could be cyclical or non-linear interactions that a simple correlation test will not find. I was very happy to find the relationship between density, residual sugar and quality. An interesting future investigation would be to find the concentration at which different acidities, chlorides and sulphates have a discernable taste, and then colorize the wines in excess of those values to see where they occur on scatter plots of different variables. I think another interesting investigation would be to compare dry vs. sweet wines by splitting the residual sugar variable into two groups.